{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. 模式匹配与正则表达式笔记（第8章）\n",
    "正则表达式，又称规则表达式。（英语：Regular Expression，在代码中常简写为regex、regexp或RE），计算机科学的一个概念。正则表达式通常被用来检索、替换那些符合某个模式(规则)的文本。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**1.1 正则表达式匹配步骤**  \n",
    "虽然在Python 中使用正则表达式有几个步骤，但每一步都相当简单。  \n",
    "1. 用import re 导入正则表达式模块。  \n",
    "2. 用re.compile()函数创建一个Regex 对象（记得使用原始字符串）。  \n",
    "3. 向Regex 对象的search()方法传入想查找的字符串。它返回一个Match 对象。  \n",
    "4. 调用Match 对象的group()方法，返回实际匹配文本的字符。  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "phone num found is: 415-555-4242\n"
     ]
    }
   ],
   "source": [
    "# 用import re 导入正则表达式模块\n",
    "import re\n",
    "# 创建了一个Regex对象\n",
    "phoneNumRegex = re.compile(r'\\d\\d\\d-\\d\\d\\d-\\d\\d\\d\\d')\n",
    "# 匹配字符串\n",
    "mo = phoneNumRegex.search('My number is 415-555-4242.')\n",
    "# group()方法返回实际匹配的字符串\n",
    "print('phone num found is: ' + mo.group())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**1.2 正则表达式匹配更多模式**  \n",
    "1. 利用括号分组  \n",
    "添加括号将在正则表达式中创建“分组”(\\d\\d\\d)-(\\d\\d\\d-\\d\\d\\d\\d)。然后可以使用group()匹配对象方法，从一个分组中获取匹配的文本。\n",
    "2. 用管道匹配多个分组  \n",
    "字符|称为“管道”。希望匹配许多表达式中的一个时，就可以使用它。例如，正则表达式r'Batman|Tina Fey'将匹配'Batman'或'Tina Fey'。\n",
    "3. 用问号实现可选匹配  \n",
    "有时候，想匹配的模式是可选的。就是说，不论这段文本在不在，正则表达式都会认为匹配。字符?表明它前面的分组在这个模式中是可选的。也可以理解为匹配这个问号之前的分组零次或一次。\n",
    "4. 用星号匹配零次或多次  \n",
    "*（称为星号）意味着“匹配零次或多次”，即星号之前的分组，可以在文本中出现任意次。它可以完全不存在，或一次又一次地重复。\n",
    "5. 用加号匹配一次或多次  \n",
    "加号+前面的分组必须“至少出现一次\n",
    "6. 用花括号匹配特定次数     \n",
    "    + 如果想要一个分组重复特定次数，就在正则表达式中该分组的后面，跟上花括号包围的数字。例如，正则表达式(Ha){3}将匹配字符串'HaHaHa'。  \n",
    "    + 除了一个数字，还可以指定一个范围，即在花括号中写下一个最小值、一个逗号和一个最大值。例如，正则表达式(Ha){1,2}将匹配 'Ha' 和 'HaHa'。\n",
    "    + 也可以不写花括号中的第一个或第二个数字，不限定最小值或最大值。例如，(Ha){3,}将匹配3 次或更多次实例。\n",
    "7. 用花括号和问号匹配非贪心模式  \n",
    "Python 的正则表达式默认是“贪心”的，这表示在有二义的情况下，它们会尽可能匹配最长的字符串。花括号的“非贪心”版本匹配尽可能最短的字符串，即在\n",
    "结束的花括号后跟着一个问号。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**1.3 findall()方法**  \n",
    "search()将返回一个Match对象，包含被查找字符串中的“第一次”匹配的文本，而findall()方法将返回一组字符串，包含被查找字符串中的所有匹配。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'415-555-9999'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import re\n",
    "phoneNumRegex = re.compile(r'\\d\\d\\d-\\d\\d\\d-\\d\\d\\d\\d')\n",
    "# search返回第一次匹配\n",
    "mo = phoneNumRegex.search('Cell: 415-555-9999 Work: 212-555-0000')\n",
    "mo.group()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['415-555-9999', '212-555-0000']"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# findall返回所有结果\n",
    "phoneNumRegex = re.compile(r'\\d\\d\\d-\\d\\d\\d-\\d\\d\\d\\d')\n",
    "phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**1.4 字符分类**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "|缩写字符分类|表示|\n",
    "|-|-|\n",
    "|\\d|0 到9 的任何数字|\n",
    "|\\D|除0 到9 的数字以外的任何字符|\n",
    "|\\w |任何字母、数字或下划线字符（可以认为是匹配“单词”字符）|\n",
    "|\\W| 除字母、数字和下划线以外的任何字符|\n",
    "|\\s| 空格、制表符或换行符（可以认为是匹配“空白”字符）|\n",
    "|\\S| 除空格、制表符和换行符以外的任何字符|"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**1.5 建立自己的字符分类**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "+ 有时候你想匹配一组字符，但缩写的字符分类（\\d、\\w、\\s 等）太宽泛。你可\n",
    "以用方括号定义自己的字符分类。例如，字符分类[aeiouAEIOU]将匹配所有元音字\n",
    "符，不论大小写。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o', 'A', 'O', 'O']"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import re\n",
    "vowelRegex = re.compile(r'[aeiouAEIOU]')\n",
    "vowelRegex.findall('RoboCop eats baby food. BABY FOOD.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "+ 可以使用短横表示字母或数字的范围。例如，字符分类[a-zA-Z0-9]将匹配所有小写字母、大写字母和数字。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['R',\n",
       " 'o',\n",
       " 'b',\n",
       " 'o',\n",
       " 'C',\n",
       " 'o',\n",
       " 'p',\n",
       " 'e',\n",
       " 'a',\n",
       " 't',\n",
       " 's',\n",
       " 'b',\n",
       " 'a',\n",
       " 'b',\n",
       " 'y',\n",
       " 'f',\n",
       " 'o',\n",
       " 'o',\n",
       " 'd',\n",
       " 'B',\n",
       " 'A',\n",
       " 'B',\n",
       " 'Y',\n",
       " 'F',\n",
       " 'O',\n",
       " 'O',\n",
       " 'D']"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import re\n",
    "vowelRegex = re.compile(r'[a-zA-Z0-9]')\n",
    "vowelRegex.findall('RoboCop eats baby food. BABY FOOD.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "+ 通过在字符分类的左方括号后加上一个插入字符（^），就可以得到“非字符类”。非字符类将匹配不在这个字符类中的所有字符。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['R',\n",
       " 'b',\n",
       " 'C',\n",
       " 'p',\n",
       " ' ',\n",
       " 't',\n",
       " 's',\n",
       " ' ',\n",
       " 'b',\n",
       " 'b',\n",
       " 'y',\n",
       " ' ',\n",
       " 'f',\n",
       " 'd',\n",
       " '.',\n",
       " ' ',\n",
       " 'B',\n",
       " 'B',\n",
       " 'Y',\n",
       " ' ',\n",
       " 'F',\n",
       " 'D',\n",
       " '.']"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import re\n",
    "consonantRegex = re.compile(r'[^aeiouAEIOU]')\n",
    "consonantRegex.findall('RoboCop eats baby food. BABY FOOD.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "+ 插入字符和美元字符  \n",
    "正则表达式的开始处使用插入符号（^），表明匹配必须发生在被查找文本开始处。\n",
    "正则表达式的末尾加上美元符号，表示该字符串必须以这个正则表达式的模式结束。可以同时使用^和$，表明整个字符串必须匹配该模式。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**1.6 通配字符**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. .(句点)字符称为“通配符”。它匹配除了换行之外的所有字符。\n",
    "2. 用点-星（.*）表示“任意文本”。\n",
    "3. 通过传入re.DOTALL 作为re.compile()的第二个参数，可以让句点字符匹配所有字符，包括换行字符。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**1.7 正则表达式符号总结**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![img](https://gitee.com/luminious/article_picture_warehouse/raw/master/Python-Study-Notes/7.10.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**1.8 sub() 替换字符串**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Regex对象的sub()方法需要传入两个参数。第一个参数是一个字符串，用于取代发现的匹配。第二个参数是一个字符串，即正则表达式。sub()方法返回替换完成后的字符串。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'CENSORED gave the secret documents to CENSORED.'"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import re\n",
    "namesRegex = re.compile(r'Agent \\w+')\n",
    "namesRegex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**1.9 compile的第二个参数**\n",
    "1. re.IGNORECASE 忽视大小写\n",
    "2. re.VERBOSE 编写注释\n",
    "3. re.DOTALL 句点表示所有字符（没有这个则表示除换行符外的所有字符）"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. 项目练习"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1 电话号码和E-mail 地址提取程序 \n",
    "功能：在剪贴板的文本中查找电话号码和E-mail 地址，按一下Ctrl-A 选择所有文本，按下Ctrl-C 将它复制到剪贴板，然后运行程序。找到电话号码和E-mail地址，替换掉剪贴板中的文本。  \n",
    "文本例子地址：http://www.nostarch.com/contactus"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**主要步骤:**  \n",
    "第1步：为电话号码创建一个正则表达式  \n",
    "第2步：为E-mail 地址创建一个正则表达式  \n",
    "第3步：在剪贴板文本中找到所有匹配    \n",
    "第4步：所有匹配连接成一个字符串，复制到剪贴板"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "No phone numbers or email addresses found.\n"
     ]
    }
   ],
   "source": [
    "#! python3\n",
    "# Finds phone numbers and email addresses on the clipboard.\n",
    "# text:\n",
    "'''\n",
    "Contact Us\n",
    "\n",
    "No Starch Press, Inc.\n",
    "245 8th Street\n",
    "San Francisco, CA 94103 USA\n",
    "Phone: 800.420.7240 or +1 415.863.9900 (9 a.m. to 5 p.m., M-F, PST)\n",
    "Fax: +1 415.863.9950\n",
    "\n",
    "Reach Us by Email\n",
    "\n",
    "General inquiries: info@nostarch.com\n",
    "Media requests: media@nostarch.com\n",
    "Academic requests: academic@nostarch.com (Please see this page for academic review requests)\n",
    "Help with your order: info@nostarch.com\n",
    "Reach Us on Social Media\n",
    "Twitter\n",
    "Facebook\n",
    "Instagram\n",
    "Pinterest\n",
    "100.420.7240 x 123\n",
    "'''\n",
    "\n",
    "import pyperclip\n",
    "import re\n",
    "\n",
    "# 为电话号码创建一个正则表达式\n",
    "# 电话号码 phone\n",
    "phoneRegex = re.compile(r'''(\n",
    "        (\\d{3}|\\(\\d{3}\\))? # area code\n",
    "        (\\s|-|\\.)? # separator\n",
    "        (\\d{3}) # first 3 digits\n",
    "        (\\s|-|\\.) # separator\n",
    "        (\\d{4}) # last 4 digits\n",
    "        (\\s*(ext|x|ext.)\\s*(\\d{2,5}))? # extension\n",
    "        )''', re.VERBOSE)\n",
    "\n",
    "# 为E-mail 地址创建一个正则表达式\n",
    "# Create email regex. 邮件\n",
    "emailRegex = re.compile(r'''(\n",
    "    [a-zA-Z0-9._%+-]+ # username\n",
    "    @ # @ symbol\n",
    "    [a-zA-Z0-9.-]+ # domain name\n",
    "    (\\.[a-zA-Z]{2,4}) # dot-something\n",
    "    )''', re.VERBOSE)\n",
    "# TODO: Create email regex.\n",
    "# TODO: Find matches in clipboard text.\n",
    "# TODO: Copy results to the clipboard.\n",
    "\n",
    "\n",
    "# Find matches in clipboard text.\n",
    "# 获得剪切板文件\n",
    "text = str(pyperclip.paste())\n",
    "matches = []\n",
    "a = phoneRegex.findall(text)\n",
    "for groups in phoneRegex.findall(text):\n",
    "    # 读取数字\n",
    "    phoneNum = '-'.join([groups[1], groups[3], groups[5]])\n",
    "    # 如果没有分号\n",
    "    if groups[8] != '':\n",
    "        phoneNum += ' x' + groups[8]\n",
    "    matches.append(phoneNum)\n",
    "for groups in emailRegex.findall(text):\n",
    "    matches.append(groups[0])\n",
    "\n",
    "# Copy results to the clipboard.\n",
    "if len(matches) > 0:\n",
    "    pyperclip.copy('\\n'.join(matches))\n",
    "    print('Copied to clipboard:')\n",
    "    print('\\n'.join(matches))\n",
    "else:\n",
    "    print('No phone numbers or email addresses found.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 强口令检测  \n",
    "写一个函数，它使用正则表达式，确保传入的口令字符串是强口令。强口令的定义是：长度不少于8 个字符，同时包含大写和小写字符，至少有一位数字。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "True\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "import pyperclip\n",
    "\n",
    "def detect(text):\n",
    "    if len(text)<8:\n",
    "        return False\n",
    "    test1=re.compile(r'([0-9])')\n",
    "    result1=test1.search(text)\n",
    "    if result1==None:\n",
    "        return False\n",
    "            \n",
    "    test2=re.compile(r'([A-Z])')\n",
    "    result2=test2.search(text)\n",
    "    if result2==None:\n",
    "        return False       \n",
    "    \n",
    "    test3=re.compile(r'([a-z])')\n",
    "    result3=test3.search(text)\n",
    "    if result3==None:\n",
    "        return False\n",
    "    \n",
    "    return True        \n",
    "            \n",
    "text='uusssZmi0546'\n",
    "status=detect(text)\n",
    "print(status)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 strip()的正则表达式版本\n",
    "写一个函数，它接受一个字符串，做的事情和strip()字符串方法一样。如果只传入了要去除的字符串，没有其他参数，那么就从该字符串首尾去除空白字符。否\n",
    "则，函数第二个参数指定的字符将从该字符串中去除。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "BaconSpamEgg\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "def my_strip(text,param=''):\n",
    "    if len(param)<1:\n",
    "        param='\\s'\n",
    "    params_begin= r'^[{}]*'.format(param)\n",
    "    params_end= r'[{}]*$'.format(param)\n",
    "    my_begin=re.compile(params_begin,re.I)\n",
    "    my_end=re.compile(params_end,re.I)\n",
    "    my=my_begin.sub('', text)\n",
    "    text=my_end.sub('', my)\n",
    "    return text\n",
    "text='SpamSpamBaconSpamEggsSpamSpam'\n",
    "param='SPAM'    \n",
    "text=my_strip(text,'SPAM')\n",
    "print(text)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
